Scott Cole

8 May 2017

This notebook covers how to use python to analyze the relationship between multiple input variables and one output variable using multiple linear regression.

Outline:

Regular (single-variable) linear regression
Multiple linear regression
Regressing out one variable from another

Import libraries

We will be using the library statsmodels to run our linear regression



In [1]:

    
import numpy as np
import statsmodels.formula.api as smf
import pandas as pd
import scipy as sp

%matplotlib notebook
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
import matplotlib.pyplot as plt

1. Regular linear regression

1a. Fitting a line

$y_i = \beta_0+\beta_1x_i+\epsilon_i$



In [2]:

    
# Define true statistics relating x and y
N_points = 10
true_beta0 = 0
true_beta1 = 2
noise_stdev = 1

# Set random seed
np.random.seed(0)

# Generate correlated data
x = np.random.randn(N_points) + 2
y = true_beta0 + true_beta1*x + np.random.randn(N_points)*noise_stdev
print('x=', x)
print('y=', y)









    



x= [ 3.76405235  2.40015721  2.97873798  4.2408932   3.86755799  1.02272212
  2.95008842  1.84864279  1.89678115  2.4105985 ]
y= [ 7.67214826  6.25458792  6.71851369  8.60346141  8.17897921  2.37911857
  7.39425591  3.49212732  4.10663     3.96710126]



In [3]:

    
# Plot x and y
plt.figure(figsize=(4,4))
plt.plot(x, y, 'k.', ms=12)
plt.xlabel('x',size=15)
plt.ylabel('y',size=15)









    Out[3]:





<matplotlib.text.Text at 0x1145cfda0>



In [4]:

    
# Fit line to data
A = np.vstack([x, np.ones(len(x))]).T
m, b = np.linalg.lstsq(A, y)[0]
print('True statistics: y =', true_beta1, '*x +', true_beta0)
print('Estimated stats: y =', m, '*x +', b)
print('R squared (fraction of variance explained) =',np.round(sp.stats.pearsonr(x,y)[0],2))









    



True statistics: y = 2 *x + 0
Estimated stats: y = 2.04994401661 *x + 0.26389814044
R squared (fraction of variance explained) = 0.95



In [5]:

    
# Plot fitted line
plt.figure(figsize=(4,4))
plt.plot(x, y, 'k.', ms=12)
plt.plot([0,5], [true_beta1*x+true_beta0 for x in [0,5]], 'k--',label='True correlation')
plt.plot([0,5], [m*x+b for x in [0,5]], 'r--',label='Estimated correlation')
plt.xlabel('x',size=15)
plt.ylabel('y',size=15)
plt.xlim((0,5))
plt.legend(loc='best')









    Out[5]:





<matplotlib.legend.Legend at 0x114854fd0>

1b. Outliers and normality of errors

Minimizing the squared error of the line fit works well if our data is normally distributed.

In this section, we simulate some data in which the error is not normally distributed, and see the fitted line is not as accurate.



In [6]:

    
# Simulate data with non-normal distribution of error
np.random.seed(1)
N_points = 100
x = np.random.randn(N_points) + 2
y = true_beta0 + true_beta1*x + np.random.randn(N_points)**2



In [7]:

    
# Fit line to data
A = np.vstack([x, np.ones(len(x))]).T
m, b = np.linalg.lstsq(A, y)[0]
print('True statistics: y =', true_beta1, '*x +', true_beta0)
print('Estimated stats: y =', m, '*x +', b)
print('R squared (fraction of variance explained) =',np.round(sp.stats.pearsonr(x,y)[0],2))









    



True statistics: y = 2 *x + 0
Estimated stats: y = 2.04202383003 *x + 0.805367000302
R squared (fraction of variance explained) = 0.82



In [8]:

    
# Plot fitted line
plt.figure(figsize=(4,4))
plt.plot(x, y, 'k.', ms=8)
plt.plot([0,5], [true_beta1*x+true_beta0 for x in [0,5]], 'k--',label='True correlation')
plt.plot([0,5], [m*x+b for x in [0,5]], 'r--', label='Estimated correlation')
plt.xlabel('x',size=15)
plt.ylabel('y',size=15)
plt.xlim((0,5))
plt.legend(loc='best')









    Out[8]:





<matplotlib.legend.Legend at 0x114a4ac88>



In [9]:

    
plt.figure(figsize=(8,3))
errors = y - [m*xi+b for xi in x]
hist2 = plt.hist(np.random.randn(100000)*np.std(errors),np.arange(-8,8,.5),color='r', normed=True, alpha=.5,label='normal')
hist = plt.hist(errors,np.arange(-8,8,.5),color='k', normed=True, alpha=.3,label='True error')
plt.legend(loc='best')
plt.xlabel('Estimate error')
plt.ylabel('Probability')
plt.yticks(np.arange(0,1.2,.2),np.arange(0,.6,.1))









    Out[9]:





([<matplotlib.axis.YTick at 0x114afeef0>,
  <matplotlib.axis.YTick at 0x114894400>,
  <matplotlib.axis.YTick at 0x114b01048>,
  <matplotlib.axis.YTick at 0x114e84dd8>,
  <matplotlib.axis.YTick at 0x114e8a7f0>,
  <matplotlib.axis.YTick at 0x114e8e208>],
 <a list of 6 Text yticklabel objects>)

1c. Importance of independence of samples

Linear regression works well only if samples are independent and identically distributed (IID). If this assumption is violated, then the computed correlation statistics are not reliable.



In [10]:

    
# Burrito information
np.random.seed(0)
burrito1_cost = 6 + np.random.randn(50)
burrito1_stars = 3.5 + np.random.randn(50)*.8
burrito_new_cost = 4
burrito_new_stars = np.arange(4,5.1,.1)

# Define cost and stars arrays
c = np.append(np.ones(len(burrito_new_stars))*burrito_new_cost,burrito1_cost)
s = np.append(burrito_new_stars,burrito1_stars)

# Compute correlation
print('Statistics of random data points')
print('R squared (fraction of variance explained) =',np.round(sp.stats.pearsonr(c,s)[0]**2,2))
print('p =',np.round(sp.stats.pearsonr(c,s)[1],3))

print('\nStatistics after adding in 10 non-independent data points')
print('R squared (fraction of variance explained) =',np.round(sp.stats.pearsonr(burrito1_cost, burrito1_stars)[0],2))
print('p =',np.round(sp.stats.pearsonr(burrito1_cost, burrito1_stars)[1],3))









    



Statistics of random data points
R squared (fraction of variance explained) = 0.13
p = 0.004

Statistics after adding in 10 non-independent data points
R squared (fraction of variance explained) = -0.06
p = 0.683



In [11]:

    
# Fit line to data
A = np.vstack([c, np.ones(len(c))]).T
m, b = np.linalg.lstsq(A, s)[0]

# Plot fitted line
plt.figure(figsize=(4,4))
plt.plot(c, s, 'k.', ms=8)
plt.plot([0,10], [m*x+b for x in [0,10]], 'k--')
plt.xlabel('Burrito cost')
plt.ylabel('Stars')
plt.xlim((0,10))









    Out[11]:





(0, 10)

2. Multiple linear regression

Load burritos



In [12]:

    
# Load burrito data into pandas dataframe
url = 'https://docs.google.com/spreadsheet/ccc?key=18HkrklYz1bKpDLeL-kaMrGjAhUM6LeJMIACwEljCgaw&output=csv'
df = pd.read_csv(url)

# Delete unreliable ratings
import pandasql
df.Unreliable = df.Unreliable.map({'x':1,'X':1,1:1})
df.Unreliable = df.Unreliable.fillna(0)
q = """SELECT * FROM df WHERE unreliable == 0"""
df = pandasql.sqldf(q.lower(), locals())

# Rename meat:filling column because statsmodels sucks
df.rename(columns={'Meat:filling': 'Meatratio'}, inplace=True)

# Limit data to main features
df = df[['Location','Burrito','Yelp','Cost','Hunger', 'Volume', 'Tortilla', 'Temp', 'Meat',
         'Fillings', 'Meatratio', 'Uniformity', 'Salsa', 'Synergy', 'Wrap', 'overall']]
df.tail()









    Out[12]:






  
    
      
      Location
      Burrito
      Yelp
      Cost
      Hunger
      Volume
      Tortilla
      Temp
      Meat
      Fillings
      Meatratio
      Uniformity
      Salsa
      Synergy
      Wrap
      overall
    
  
  
    
      306
      Taco Villa
      Carnitas
      NaN
      5.99
      4.0
      0.59
      4.0
      4.0
      4.0
      3.0
      4.0
      3.5
      2.5
      3.5
      5.0
      3.5
    
    
      307
      Lupe's Taco Shop
      TGunz
      NaN
      10.00
      3.0
      1.54
      3.0
      4.0
      3.2
      3.5
      1.8
      1.8
      3.5
      3.7
      3.5
      3.9
    
    
      308
      Lupe's Taco Shop
      TGunz
      NaN
      10.00
      4.2
      1.54
      3.0
      4.5
      3.5
      3.5
      2.5
      2.0
      3.0
      3.5
      4.5
      3.5
    
    
      309
      Filiberto's
      Adobada
      3.0
      6.25
      4.4
      NaN
      4.2
      4.5
      4.0
      3.0
      5.0
      3.8
      2.5
      3.0
      5.0
      3.7
    
    
      310
      Filiberto's
      California
      NaN
      6.25
      3.0
      NaN
      4.5
      4.5
      3.7
      3.5
      4.0
      3.5
      3.0
      4.0
      4.5
      4.0

2a. Individual linear regressions between burrito dimensions and overall satisfaction rating (BAD)

Ignores redundant information across features
Multiple comparison problem



In [13]:

    
# Define dimensions of interest
dims = ['Cost', 'Hunger', 'Tortilla', 'Temp', 'Meat',
        'Fillings', 'Meatratio', 'Uniformity', 'Salsa', 'Wrap']

# Correlate each dimension to the overall satisfaction rating
results = {}
for d in dims:
    model_str = 'overall ~ ' + d
    results[d] = smf.ols(model_str, data=df, missing='drop').fit()
    print(d,', R2 =',results[d].rsquared, ', p =',np.round(results[d].pvalues[d],4))









    



Cost , R2 = 0.023711482916 , p = 0.0071
Hunger , R2 = 0.0237099905646 , p = 0.0069
Tortilla , R2 = 0.134042259073 , p = 0.0
Temp , R2 = 0.0688476427895 , p = 0.0
Meat , R2 = 0.462036454737 , p = 0.0
Fillings , R2 = 0.52080645318 , p = 0.0
Meatratio , R2 = 0.271881239693 , p = 0.0
Uniformity , R2 = 0.178027143455 , p = 0.0
Salsa , R2 = 0.160192091587 , p = 0.0
Wrap , R2 = 0.0228518029439 , p = 0.008



In [14]:

    
plt.plot(df['Fillings'],df['overall'],'k.')
plt.xlabel('Nonmeat filling flavor')
plt.ylabel('overall satisfaction')









    Out[14]:





<matplotlib.text.Text at 0x11668af98>

2b. Multiple linear regression



In [15]:

    
model_str = 'overall ~ ' + ' + '.join(dims)
print(model_str)









    



overall ~ Cost + Hunger + Tortilla + Temp + Meat + Fillings + Meatratio + Uniformity + Salsa + Wrap



In [16]:

    
results_all = smf.ols(model_str, data=df, missing='drop').fit()
print(results_all.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                overall   R-squared:                       0.723
Model:                            OLS   Adj. R-squared:                  0.712
Method:                 Least Squares   F-statistic:                     68.48
Date:                Mon, 08 May 2017   Prob (F-statistic):           2.25e-67
Time:                        12:50:15   Log-Likelihood:                -115.21
No. Observations:                 274   AIC:                             252.4
Df Residuals:                     263   BIC:                             292.2
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -0.5042      0.245     -2.055      0.041        -0.987    -0.021
Cost           0.0217      0.023      0.932      0.352        -0.024     0.068
Hunger         0.0200      0.030      0.668      0.505        -0.039     0.079
Tortilla       0.0631      0.033      1.919      0.056        -0.002     0.128
Temp           0.0617      0.024      2.540      0.012         0.014     0.109
Meat           0.2816      0.035      8.085      0.000         0.213     0.350
Fillings       0.3752      0.037     10.044      0.000         0.302     0.449
Meatratio      0.1389      0.027      5.062      0.000         0.085     0.193
Uniformity     0.0749      0.025      3.023      0.003         0.026     0.124
Salsa          0.0597      0.027      2.174      0.031         0.006     0.114
Wrap           0.0349      0.022      1.619      0.107        -0.008     0.077
==============================================================================
Omnibus:                       51.954   Durbin-Watson:                   1.631
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              158.345
Skew:                          -0.808   Prob(JB):                     4.13e-35
Kurtosis:                       6.356   Cond. No.                         139.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Add in 'flavor synergy' to model



In [17]:

    
dims = ['Cost','Hunger', 'Tortilla', 'Temp', 'Meat',
        'Fillings', 'Meatratio', 'Uniformity', 'Salsa', 'Synergy', 'Wrap']
model_str = 'overall ~ ' + ' + '.join(dims)
results_all = smf.ols(model_str, data=df, missing='drop').fit()
print(results_all.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                overall   R-squared:                       0.781
Model:                            OLS   Adj. R-squared:                  0.772
Method:                 Least Squares   F-statistic:                     84.41
Date:                Mon, 08 May 2017   Prob (F-statistic):           3.58e-79
Time:                        12:50:15   Log-Likelihood:                -82.726
No. Observations:                 272   AIC:                             189.5
Df Residuals:                     260   BIC:                             232.7
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -0.4903      0.219     -2.237      0.026        -0.922    -0.059
Cost           0.0317      0.021      1.523      0.129        -0.009     0.073
Hunger         0.0187      0.027      0.698      0.486        -0.034     0.071
Tortilla       0.0309      0.030      1.043      0.298        -0.027     0.089
Temp           0.0615      0.022      2.836      0.005         0.019     0.104
Meat           0.2065      0.032      6.375      0.000         0.143     0.270
Fillings       0.2486      0.037      6.789      0.000         0.177     0.321
Meatratio      0.0880      0.025      3.481      0.001         0.038     0.138
Uniformity     0.0701      0.022      3.152      0.002         0.026     0.114
Salsa          0.0126      0.025      0.495      0.621        -0.038     0.063
Synergy        0.2978      0.036      8.334      0.000         0.227     0.368
Wrap           0.0463      0.019      2.398      0.017         0.008     0.084
==============================================================================
Omnibus:                       33.545   Durbin-Watson:                   1.623
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               79.744
Skew:                          -0.584   Prob(JB):                     4.83e-18
Kurtosis:                       5.382   Cond. No.                         144.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

2c. Correlation matrix



In [18]:

    
dfcorr = df[dims].corr()
M = len(dims)
from matplotlib import cm

clim1 = (-1,1)
plt.figure(figsize=(12,10))
cax = plt.pcolor(range(M+1), range(M+1), dfcorr, cmap=cm.bwr)
cbar = plt.colorbar(cax, ticks=(-1,-.5,0,.5,1))
cbar.ax.set_ylabel('Pearson correlation (r)', size=30)
plt.clim(clim1)
cbar.ax.set_yticklabels((-1,-.5,0,.5,1),size=20)
ax = plt.gca()
ax.set_yticks(np.arange(M)+.5)
ax.set_yticklabels(dims,size=25)
ax.set_xticks(np.arange(M)+.5)
ax.set_xticklabels(dims,size=25)
plt.xticks(rotation='vertical')
plt.tight_layout()
plt.xlim((0,M))
plt.ylim((0,M))









    Out[18]:





(0, 11)

3. Regressing out

If you want to know if a feature contains significantly more information about the output, beyond what is contained in another feature, you can regress out the latter from the former

3a. Manufacture input features with redundant information about output variable



In [19]:

    
y = np.arange(1,2,.01)
x1 = y + np.random.randn(len(y))*.1
x2 = x1 + np.random.randn(len(y))*.3



In [20]:

    
plt.figure(figsize=(8,4))
plt.subplot(1,2,1)
plt.plot(x1,y,'k.')
plt.ylabel('y')
plt.xlabel('x1')
plt.subplot(1,2,2)
plt.plot(x2,y,'k.')
plt.xlabel('x2')

plt.figure(figsize=(4,4))
plt.plot(x1,x2,'k.')
plt.xlabel('x1')
plt.ylabel('x2')









    Out[20]:





<matplotlib.text.Text at 0x11c93e470>



In [21]:

    
print('Correlation coefficient between x1 and y: ', np.round(sp.stats.pearsonr(x1,y)[0],3))
print('Correlation coefficient between x2 and y: ', np.round(sp.stats.pearsonr(x2,y)[0],3))
print('Correlation coefficient between x1 and x2: ', np.round(sp.stats.pearsonr(x1,x2)[0],3))









    



Correlation coefficient between x1 and y:  0.936
Correlation coefficient between x2 and y:  0.693
Correlation coefficient between x1 and x2:  0.749

3b. Regress out x1 from x2



In [22]:

    
# Regress out features
def regress_out(x, y):
    """Regress x out of y to get a new y value"""
    A = np.vstack([x, np.ones(len(x))]).T
    m, b = np.linalg.lstsq(A, y)[0]
    return y - b - x*m

x2b = regress_out(x1, x2)



In [23]:

    
# Visualize relationships with x2 after regressing out x1
plt.figure(figsize=(4,7))
plt.subplot(2,1,1)
plt.plot(x2b,x1,'k.')
plt.ylabel('x1')
plt.subplot(2,1,2)
plt.plot(x2b,y,'k.')
plt.ylabel('y')
plt.xlabel('x2 after regressing out x1')









    Out[23]:





<matplotlib.text.Text at 0x11cae1198>



In [24]:

    
print('After regressing out x1 from x2:')
print('Correlation coefficient between x2 and y: ', np.round(sp.stats.pearsonr(x2b,y)[0],3))
print('Correlation coefficient between x1 and x2: ', np.round(sp.stats.pearsonr(x1,x2b)[0],3))









    



After regressing out x1 from x2:
Correlation coefficient between x2 and y:  -0.013
Correlation coefficient between x1 and x2:  -0.0

3c. Multiple linear regression of x1 and x2 to predict y



In [25]:

    
df = pd.DataFrame.from_dict({'x1':x1, 'x2':x2, 'y':y})
results_all = smf.ols('y ~ x1 + x2', data=df).fit()
print(results_all.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.877
Model:                            OLS   Adj. R-squared:                  0.874
Method:                 Least Squares   F-statistic:                     344.3
Date:                Mon, 08 May 2017   Prob (F-statistic):           8.70e-45
Time:                        12:50:17   Log-Likelihood:                 86.947
No. Observations:                 100   AIC:                            -167.9
Df Residuals:                      97   BIC:                            -160.1
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      0.0937      0.055      1.712      0.090        -0.015     0.202
x1             0.9449      0.054     17.636      0.000         0.839     1.051
x2            -0.0128      0.036     -0.353      0.725        -0.085     0.059
==============================================================================
Omnibus:                        3.020   Durbin-Watson:                   1.758
Prob(Omnibus):                  0.221   Jarque-Bera (JB):                2.313
Skew:                          -0.225   Prob(JB):                        0.315
Kurtosis:                       2.406   Cond. No.                         17.0
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



In [ ]:

	Location	Burrito	Yelp	Cost	Hunger	Volume	Tortilla	Temp	Meat	Fillings	Meatratio	Uniformity	Salsa	Synergy	Wrap	overall
306	Taco Villa	Carnitas	NaN	5.99	4.0	0.59	4.0	4.0	4.0	3.0	4.0	3.5	2.5	3.5	5.0	3.5
307	Lupe's Taco Shop	TGunz	NaN	10.00	3.0	1.54	3.0	4.0	3.2	3.5	1.8	1.8	3.5	3.7	3.5	3.9
308	Lupe's Taco Shop	TGunz	NaN	10.00	4.2	1.54	3.0	4.5	3.5	3.5	2.5	2.0	3.0	3.5	4.5	3.5
309	Filiberto's	Adobada	3.0	6.25	4.4	NaN	4.2	4.5	4.0	3.0	5.0	3.8	2.5	3.0	5.0	3.7
310	Filiberto's	California	NaN	6.25	3.0	NaN	4.5	4.5	3.7	3.5	4.0	3.5	3.0	4.0	4.5	4.0